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Topic models, such as latent Dirichlet allocation (LDA), can be 
useful tools for the statistical analysis of document collections and 
other discrete data. The LDA model assumes that the words of each 
document arise from a mixture of topics, each of which is a distri- 
bution over the vocabulary. A limitation of LDA is the inability to 
model topic correlation even though, for example, a document about 
genetics is more likely to also be about disease than X-ray astron- 
omy. This limitation stems from the use of the Dirichlet distribution 
to model the variability among the topic proportions. In this paper 
we develop the correlated topic model (CTM), where the topic pro- 
portions exhibit correlation via the logistic normal distribution [J. 
Roy. Statist. Soc. Ser. B 44 (1982) 139-177]. We derive a fast varia- 
tional inference algorithm for approximate posterior inference in this 
model, which is complicated by the fact that the logistic normal is 
not conjugate to the multinomial. We apply the CTM to the articles 
from Science published from 1990-1999, a data set that comprises 
57M words. The CTM gives a better fit of the data than LDA, and 
we demonstrate its use as an exploratory tool of large document col- 
lections. 

1. Introduction. Large collections of documents are readily available on- 
line and widely accessed by diverse communities. As a notable example, 
scholarly articles are increasingly published in electronic form, and histor- 
ical archives are being scanned and made accessible. The not-for-profit or- 
ganization JSTOR (www.jstor.org) is currently one of the leading providers 
of journals to the scholarly community. These archives are created by scan- 
ning old journals and running an optical character recognizer over the pages. 
JSTOR provides the original scans on-line, and uses their noisy version of 
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the text to support keyword search. Since the data are largely unstructured 
and comprise millions of articles spanning centuries of scholarly work, au- 
tomated analysis is essential. The development of new tools for browsing, 
searching and allowing the productive use of such archives is thus an impor- 
tant technological challenge, and provides new opportunities for statistical 
modeling. 

The statistical analysis of documents has a tradition that goes back at 
least to the analysis of the Federalist papers by Mosteller and Wallace [21]. 
But document modeling takes on new dimensions with massive multi-author 
collections such as the large archives that now are being made accessible 
by JSTOR, Google and other enterprises. In this paper we consider topic 
models of such collections, by which we mean latent variable models of doc- 
uments that exploit the correlations among the words and latent semantic 
themes. Topic models can extract surprisingly interpretable and useful struc- 
ture without any explicit "understanding" of the language by computer. In 
this paper we present the correlated topic model (CTM), which explicitly 
models the correlation between the latent topics in the collection, and en- 
ables the construction of topic graphs and document browsers that allow a 
user to navigate the collection in a topic-guided manner. 

The main application of this model that we present is an analysis of the 
JSTOR archive for the journal Science. This journal was founded in 1880 
by Thomas Edison and continues as one of the most influential scientific 
journals today. The variety of material in the journal, as well as the large 
number of articles ranging over more than 100 years, demonstrates the need 
for automated methods, and the potential for statistical topic models to 
provide an aid for navigating the collection. 

The correlated topic model builds on the earlier latent Dirichlet allocation 
(LDA) model of Blei, Ng and Jordan [8], which is an instance of a general 
family of mixed membership models for decomposing data into multiple la- 
tent components. LDA assumes that the words of each document arise from 
a mixture of topics, where each topic is a multinomial over a fixed word 
vocabulary. The topics are shared by all documents in the collection, but 
the topic proportions vary stochastically across documents, as they are ran- 
domly drawn from a Dirichlet distribution. Recent work has used LDA as a 
building block in more sophisticated topic models, such as author-document 
models [19, 24], abstract-reference models [12] syntax-semantics models [16] 
and image-caption models [6]. The same kind of modeling tools have also 
been used in a variety of nontext settings, such as image processing [13, 26], 
recommendation systems [18], the modeling of user profiles [14] and the 
modeling of network data [1]. Similar models were independently developed 
for disability survey data [10, 11] and population genetics [22]. 

In the parlance of the information retrieval literature, LDA is a "bag of 
words" model. This means that the words of the documents are assumed to 
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be exchangeable within them, and Blei, Ng and Jordan [8] motivate LDA 
from this assumption with de Finetti's exchangeability theorem. As a con- 
sequence, models like LDA represent documents as vectors of word counts 
in a very high dimensional space, ignoring the order in which the words ap- 
pear. While it is important to retain the exact sequence of words for reading 
comprehension, the linguistically simplistic exchangeability assumption is 
essential to efficient algorithms for automatically eliciting the broad seman- 
tic themes in a collection. 

The starting point for our analysis here is a perceived limitation of topic 
models such as LDA: they fail to directly model correlation between topics. 
In most text corpora, it is natural to expect that subsets of the underlying 
latent topics will be highly correlated. In Science, for instance, an article 
about genetics may be likely to also be about health and disease, but un- 
likely to also be about X-ray astronomy. For the LDA model, this limitation 
stems from the independence assumptions implicit in the Dirichlet distri- 
bution on the topic proportions. Under a Dirichlet, the components of the 
proportions vector are nearly independent, which leads to the strong and 
unrealistic modeling assumption that the presence of one topic is not corre- 
lated with the presence of another. The CTM replaces the Dirichlet by the 
more flexible logistic normal distribution, which incorporates a covariance 
structure among the components [4]. This gives a more realistic model of 
the latent topic structure where the presence of one latent topic may be 
correlated with the presence of another. 

However, the ability to model correlation between topics sacrifices some 
of the computational conveniences that LDA affords. Specifically, the con- 
jugacy between the multinomial and Dirichlet facilitates straightforward 
approximate posterior inference in LDA. That conjugacy is lost when the 
Dirichlet is replaced with a logistic normal. Standard simulation techniques 
such as Gibbs sampling are no longer possible, and Metropolis-Hastings 
based MCMC sampling is prohibitive due to the scale and high dimension 
of the data. 

Thus, we develop a fast variational inference procedure for carrying out 
approximate inference with the CTM model. Variational inference [17, 29] 
trades the unbiased estimates of MCMC procedures for potentially biased 
but computationally efficient algorithms whose numerical convergence is 
easy to assess. Variational inference algorithms have been effective in LDA 
for analyzing large document collections [8]. We extend their use to the 
CTM. 

The rest of this paper is organized as follows. We first present the corre- 
lated topic model and discuss its underlying modeling assumptions. Then, 
we present an outline of the variational approach to inference (the techni- 
cal details are collected in the Appendix) and the variational expectation- 
maximization procedure for parameter estimation. Finally, we analyze the 
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performance of the model on the JSTOR Science data. Quantitatively, we 
show that it gives a better fit than LDA, as measured by the accuracy of the 
predictive distributions over held out documents. Qualitatively, we present 
an analysis of all of Science from 1990-1999, including examples of topically 
related articles found using the inferred latent structure, and topic graphs 
that are constructed from a sparse estimate of the covariance structure of 
the model. The paper concludes with a brief discussion of the results and 
future work that it suggests. 

2. The correlated topic model. The correlated topic model (CTM) is a 
hierarchical model of document collections. The CTM models the words of 
each document from a mixture model. The mixture components are shared 
by all documents in the collection; the mixture proportions are document- 
specific random variables. The CTM allows each document to exhibit mul- 
tiple topics with different proportions. It can thus capture the heterogeneity 
in grouped data that exhibit multiple latent patterns. 

We use the following terminology and notation to describe the data, latent 
variables and parameters in the CTM. 

• Words and documents. The only observable random variables that we 
consider are words that are organized into documents. Let Wd, n denote 
the nth word in the dth document, which is an element in a V-term vo- 
cabulary. Let Wd denote the vector of Nd words associated with document 
d. 

• Topics. A topic (3 is a distribution over the vocabulary, a point on the 
V — 1 simplex. The model will contain K topics (3i :K . 

• Topic assignments. Each word is each assumed drawn from one of the K 
topics. The topic assignment Zd t n is associated with the nth word and dth 
document. 

• Topic proportions. Finally, each document is associated with a set of topic 
proportions 6d, which is a point on the K — 1 simplex. Thus, 9d is a distri- 
bution over topic indices, and reflects the probabilities with which words 
are drawn from each topic in the collection. We will typically consider a 
natural parameterization of this multinomial r] = log((9j/#^). 

Specifically, the correlated topic model assumes that an ./V-word document 
arises from the following generative process. Given topics (5\.k, a K- vector 
H and a K x K covariance matrix S: 

1. Draw^l^Sj-A^S). 

2. Forne{l,...,iV d } : 

(a) Draw topic assignment Zd^Vd from Mult(/(T7 d )). 

(b) Draw word W d , n \{zd,n, Pi-.k} from Mult (,3^ ), 
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Fig. 1. Top: Probabilistic graphical model representation of the correlated topic model. 
The logistic normal distribution, used to model the latent topic proportions of a document, 
can represent correlations between topics that are impossible to capture using a Dirichlet. 
Bottom: Example densities of the logistic normal on the 2-simplex. From left: diagonal 
covariance and nonzero-mean, negative correlation between topics 1 and 2, positive corre- 
lation between topics 1 and 2. 



where f{rj) maps a natural parameterization of the topic proportions to the 
mean parameterization, 

(Note that does not index a minimal exponential family. Adding any 
constant to rj will result in an identical mean parameter.) This process is 
illustrated as a probabilistic graphical model in Figure 1. (A probabilistic 
graphical model is a graph representation of a family of joint distributions 
with a graph. Nodes denote random variables; edges denote possible depen- 
dencies between them.) 

The CTM builds on the latent Dirichlet allocation (LDA) model [8]. La- 
tent Dirichlet allocation assumes a nearly identical generative process, but 
one where the topic proportions are drawn from a Dirichlet. In LDA and 
its variants, the Dirichlet is a computationally convenient distribution over 
topic proportions because it is conjugate to the topic assignments. But, the 
Dirichlet assumes near independence of the components of the proportions. 
In fact, one can simulate a draw from a Dirichlet by drawing from K inde- 
pendent Gamma distributions and normalizing the resulting vector. (Note 
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that there is slight negative correlation due to the constraint that the com- 
ponents sum to one.) 

Rather than use a Dirichlet, the CTM draws a real valued random vector 
from a multivariate Gaussian and then maps it to the simplex to obtain 
a multinomial parameter. This is the defining characteristic of the logistic 
Normal distribution [2, 3, 4]. The covariance of the Gaussian induces de- 
pendencies between the components of the transformed random simplicial 
vector, allowing for a general pattern of variability between its components. 
The logistic normal was originally studied in the context of analyzing ob- 
served compositional data, such as the proportions of minerals in geological 
samples. In the CTM, we use it to model the latent composition of topics 
associated with each document. 

The drawback of using the logistic normal is that it is not conjugate to 
the multinomial, which complicates the corresponding approximate posterior 
inference procedure. The advantage, however, is that it provides a more 
expressive document model. The strong independence assumption imposed 
by the Dirichlet is not realistic when analyzing real document collections, 
where one finds strong correlations between the latent topics. For example, 
a document about geology is more likely to also be about archeology than 
genetics. We aim to use the covariance matrix of the logistic normal to 
capture such relationships. 

In Section 4 we illustrate how the higher order structure given by the 
covariance can be used as an exploratory tool for better understanding and 
navigating a large corpus of documents. Moreover, modeling correlation can 
lead to better predictive distributions. In some applications, such as auto- 
matic recommendation systems, the goal is to predict unseen items condi- 
tioned on a set of observations. A Dirichlet-based model will predict items 
based on the latent topics that the observations suggest, but the CTM will 
predict items associated with additional topics that are correlated with the 
conditionally probable topics. 

3. Computation with the correlated topic model. We address two com- 
putational problems that arise when using the correlated topic model to 
analyze data. First, given a collection of topics and distribution over topic 
proportions {(3i :K , /x, X!}, we estimate the posterior distribution of the latent 
variables conditioned on the words of a document p(rj,z\w, (3i. K ,fi, X). This 
lets us embed newly observed documents into the low dimensional latent the- 
matic space that the model represents. We use a fast variational inference 
algorithm to approximate this posterior, which lets us quickly analyze large 
document collections under these complicated modeling assumptions. 

Second, given a collection of documents {wi, . . . ,wd}, we find maximum 
likelihood estimates of the topics and the underlying logistic normal distri- 
bution under the modeling assumptions of the CTM. We use a variant of the 
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expectation-maximization algorithm, where the E-step is the per-document 
posterior inference problem described above. Furthermore, we seek sparse 
solutions of the inverse covariance matrix between topics, and we adapt 
l\ -regularized covariance estimation [20] for this purpose. 

3.1. Posterior inference with variational methods. Given a document w 
and a model {{3i-_k, /J>, S}, the posterior distribution of the per-document 
latent variables is 

p{r],z\w,P x . K ,iJ,^) 

(2) 

= P(v\tJ-,^)Un=lP( z n\v)p(Wn\Zn,Pl:K) 

J ^p(?7|/l,S)nn=lE^=lP(2 ; n|^)p(Wnkn,/3l:^) <V 

which is intractable to compute due to the integral in the denominator, that 
is, the marginal probability of the document that we are conditioning on. 
There are two reasons for this intractability. First, the sum over the K val- 
ues of z n occurs inside the product over words, inducing a combinatorial 
number of terms. Second, even if K N stays within the realm of compu- 
tational tractability, the distribution of topic proportions p(r]\n,Ti) is not 
conjugate to the distribution of topic assignments p{z n \rj). Thus, we cannot 
analytically compute the integrals of each term. 

The nonconjugacy further precludes using many of the Monte Carlo Markov 
chain (MCMC) sampling techniques that have been developed for comput- 
ing with Dirichlet-based mixed membership models [10, 15]. These MCMC 
methods are all based on Gibbs sampling, where the conjugacy between the 
latent variables lets us compute coordinate-wise posteriors analytically. To 
employ MCMC in the logistic normal setting considered here, we have to 
appeal to a tailored Metropolis-Hastings solution. Such a technique will not 
enjoy the same convergence properties and speed of the Gibbs samplers, 
which is particularly hindering for the goal of analyzing collections that 
comprise millions of words. 

Thus, to approximate this posterior, we appeal to variational methods as 
a deterministic alternative to MCMC. The idea behind variational methods 
is to optimize the free parameters of a distribution over the latent variables 
so that the distribution is close in Kullback-Leibler divergence to the true 
posterior [17, 29]. The fitted variational distribution is then used as a substi- 
tute for the posterior, just as the empirical distribution of samples is used in 
MCMC. Variational methods have had widespread application in machine 
learning; their potential in applied Bayesian statistics is beginning to be 
realized. 

In models composed of conjugate-exponential family pairs and mixtures, 
the variational inference algorithm can be automatically derived by comput- 
ing expectations of natural parameters in the variational distribution [5, 7, 
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31]. However, the nonconjugate pair of variables in the CTM requires that 
we derive the variational inference algorithm from first principles. 

We begin by using Jensen's inequality to bound the log probability of a 
document, 

logpOi :A r|At,£,/3) 

N 

(3) > E^pogpfalfi, £)] + V q [log p(z n \ V )] 

n=l 

N 

+ E q [logp(w n \z n ,/3)] + B(q), 

n=l 

where the expectation is taken with respect to q, a variational distribution 
of the latent variables, and H(q) denotes the entropy of that distribution. 
As a variational distribution, we use a fully factorized model, where all the 
variables are independently governed by a different distribution, 

K N 

(4) q(TJl:K,Zl:N\Xl:K,Vi:K><l>l:N) = Y[ ( l(Vi\^i, l/ i) J[ Q( z n\^ n )- 

i=l n=l 

The variational distributions of the discrete topic assignments z\-m are spec- 
ified by the i^-dimensional multinomial parameters <t>x-.N (t nese are mean 
parameters of the multinomial). The variational distribution of the contin- 
uous variables t\\-k are K independent univariate Gaussians {Aj,z/j}. Since 
the variational parameters are fit using a single observed document wun, 
there is no advantage in introducing a nondiagonal variational covariance 
matrix. 

The variational inference algorithm optimizes equation (3) with respect 
to the variational parameters, thereby tightening the bound on the marginal 
probability of the observations as much as the structure of variational distri- 
bution allows. This is equivalent to finding the variational distribution that 
minimizes KL(g||p), where p is the true posterior [17, 29]. Details of this 
optimization for the CTM are given in the Appendix. 

Note that variational methods do not come with the same theoretical 
guarantees as MCMC, where the limiting distribution of the chain is ex- 
actly the posterior of interest. However, variational methods provide fast 
algorithms and a clear convergence criterion, whereas MCMC methods can 
be computationally inefficient and determining when a Markov chain has 
converged is difficult [23]. 

3.2. Parameter estimation. Given a collection of documents, we carry 
out parameter estimation for the correlated topic model by attempting to 
maximize the likelihood of a corpus of documents as a function of the topics 
Pi-k an d the multivariate Gaussian (/^, £). 
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As in many latent variable models, we cannot compute the marginal like- 
lihood of the data because of the latent structure that needs to be marginal- 
ized out. To address this issue, we use variational expectation-maximization 
(EM). In the E-step of traditional EM, one computes the posterior distri- 
bution of the latent variables given the data and current model parameters. 
In variational EM, we use the variational approximation to the posterior 
described in the previous section. Note that this is akin to Monte Carlo EM, 
where the E-step is approximated by a Monte Carlo approximation to the 
posterior [30]. 

Specifically, the objective function of variational EM is the likelihood 
bound given by summing equation (3) over the document collection {w±, . . . ,wd}, 

d 

C(j*,E,P hK ;w 1:D ) >^E qd [logp(rj d ,z d ,w d \fj,,'£,/3 1 . K )]+Ti(q d ). 

d=i 

The variational EM algorithm is coordinate ascent in this objective func- 
tion. In the E-step, we maximize the bound with respect to the variational 
parameters by performing variational inference for each document. In the 
M-step, we maximize the bound with respect to the model parameters. This 
amounts to maximum likelihood estimation of the topics and multivariate 
Gaussian using expected sufficient statistics, where the expectation is taken 
with respect to the variational distributions computed in the E-step, 

Pi <x^2<f>d,ind, 

d 

^ = -^E A ^ 

d 

^ = ^Y. I ^ + {^d-m^d-n) T , 

d 

where is the vector of word counts for document d. 

The E-step and M-step are repeated until the bound on the likelihood 
converges. In the analysis reported below, we run variational inference until 
the relative change in the probability bound of equation (3) is less than 
0.0001%, and run variational EM until the relative change in the likelihood 
bound is less than 0.001%. 

3.3. Topic graphs. As seen below, the ability of the CTM to model the 
correlation between topics yields a better fit of a document collection than 
LDA. But the covariance of the logistic normal model for topic proportions 
can also be used to visualize the relationships among the topics. In partic- 
ular, the covariance matrix can be used to form a topic graph, where the 
nodes represent individual topics, and neighboring nodes represent highly 
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related topics. In such settings, it is useful to have a mechanism to control 
the sparsity of the graph. 

Recall that the graph encoding the independence relations in a Gaus- 
sian graphical model is specified by the zero pattern in the inverse covari- 
ance matrix. More precisely, if X ~ N(p,, S) is a .ff-dimensional multivariate 
Gaussian, and S = S _1 denotes the inverse covariance matrix, then we form 
a graph = (V, E) with vertices V corresponding to the random vari- 

ables Xi, . . . ,Xk and edges E satisfying (s,t) S E if and only if S s t ^ 0. If 
Af(s) = {t : (s,t) £ E} denotes the set of neighbors of s in the graph, then 
the independence relation X s _L X u \Xj^-^ holds for any node u ^ AA(s) that 
is not a neighbor of s. 

Recent work of Meinshausen and Biihlmann [20] shows how the lasso 
[28] can be adapted to give an asymptotically consistent estimator of the 
graph The strategy is to regress each variable X s onto all of the other 

variables, imposing an l\ penalty on the parameters to encourage sparsity. 
The nonzero components then serve as an estimate of the neighbors of s in 
the graph. 

In more detail, let k s = (k s i, . . . , k s k) £ be the parameters of the lasso 
fit obtained by regressing X s onto (A^)^ s , with the parameter k ss serving 
as the unregularized intercept. The optimization problem is 

(5) k s = argmm±||X s - X\ s K s \\l + p„||k\ s ||i, 

where X\ s denotes the set of variables with X s replaced by the vector of 
all l's, and K\ s denotes the vector k s with component k ss removed. The 
estimated set of neighbors is then 

(6) ft( s ) = {t:K st ^0}. 

Meinshausen and Biihlmann [20] show that P(7V(s) = J\f(s)) — > 1 as the 
sample size n increases, for a suitable choice of the regularization parameter 
p n satisfying np^ — log(K) — > oo. Moreover, the convergence is exponentially 
fast, and as a consequence, if K = 0{n d ) grows only polynomially with sam- 
ple size, the estimated graph is the true graph with probability approaching 
one. 

To adapt the Meinshausen-Biihlmann technique to the CTM, recall that 
we estimate the covariance matrix H using variational EM, where in the 
M-step we maximize the variational lower bound with respect to approx- 
imation computed in the E-step. For a given document d, the variational 
approximation to the posterior of r] is a normal with mean G H K . We 
treat the standardized mean vectors {A^} as data, and regress each compo- 
nent onto the others with an l\ penalty. Two simple procedures can be used 
to then form the graph edge set, by taking the conjunction or disjunction of 
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the local neighborhood estimates: 

(7) (s, t) G E AND in case t G M(s) and s G A?(i), 

(8) (s, t) G £ OR in case t G AT(s) or s G 

Figure 2 shows an example of a topic graph constructed using this method, 
with edges _E' AND formed by intersecting the neighborhood estimates. Vary- 
ing the regularization parameter p n allows control over the sparsity of the 
graph; the graph becomes increasingly sparse as p n increases. 

4. Analyzing Science. JSTOR is an on-line archive of scholarly journals 
that scans bound volumes dating back to the 1600s and runs optical char- 
acter recognition algorithms on the scans. Thus, JSTOR stores and indexes 
hundreds of millions of pages of noisy text, all searchable through the Inter- 
net. This is an invaluable resource to scholars. 

The JSTOR collection provides an opportunity for developing exploratory 
analysis and useful descriptive statistics of large volumes of text data. As 
they are, the articles are organized by journal, volume and number. But 
the users of JSTOR would benefit from a topical organization of articles 
from different journals and automatic recommendations of similar articles 
to those known to be of interest. 

In some modern electronic scholarly archives, such as the ArXiv 
(http://www.arxiv.org/), contributors provide meta-data with their manu- 
scripts that describe and categorize their work to aid in such a topical ex- 
ploration of the collection. In many text data sets, however, meta-data is 
unavailable. Moreover, there may be underlying topics and connections be- 
tween articles that the authors or curators have not determined. To these 
ends, we analyzed a large portion of JSTOR's corpus of articles from Science 
with the CTM. 

4.1. Qualitative analysis of Science. In this section we illustrate the pos- 
sible applications of the CTM to automatic corpus analysis and browsing. 
We estimated a 100-topic model on the Science articles from 1990 to 1999 
using the variational EM algorithm of Section 3.2. (C code that implements 
this algorithm can be found at the first author's web-site and STATLIB.) 
The total vocabulary size in this collection is 375,144 terms. We trim the 
356,195 terms that occurred fewer than 70 times as well as 296 stop words, 
that is, words like "the," "but" or "with," which do not convey meaning. 
This yields a corpus of 16,351 documents, 19,088 unique terms and a total 
of 5.7M words. 

Using the technique described in Section 3.3, we constructed a sparse 
graph {p = 0.1) of the connections between the estimated latent topics. Part 
of this graph is illustrated in Figure 2. (For space, we manually removed 




Fig. 2. A portion of the topic graph learned from 16,351 OCR articles from Science (1990-1999). Each topic node is labeled with its 
five most probable phrases and has font proportional to its popularity in the corpus. (Phrases are found by permutation test.) The full 
model can be found in http://www.cs.cmu.edu/~lemur/science/ and on STATLIB. 
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topics that occurred very rarely and those that captured nontopical content 
such as front matter.) This graph provides a snapshot of ten years of Sci- 
ence, and reveals different substructures of themes in the collection. A user 
interested in the brain can restrict attention to articles that use the neuro- 
science topics; a user interested in genetics can restrict attention to those 
articles in the cluster of genetics topics. 

Further structure is revealed at the document level, where each document 
is associated with a latent vector of topic proportions. The posterior distri- 
bution of the proportions can be used to associate documents with latent 
topics. For example, the following are the top five articles associated with 
the topic whose most probable vocabulary items are "laser, optical, light, 
electron, quantum": 



1. "Vacuum Squeezing of Solids: Macroscopic Quantum States Driven by 
Light Pulses" (1997). 

2. "Superradiant Rayleigh Scattering from a Bose-Einstein Condensate" 



3. "Physics and Device Applications of Optical Microcavities" (1992). 

4. "Photon Number Squeezed States in Semiconductor Lasers" (1992). 

5. "A Well-Collimated Quasi-Continuous Atom Laser" (1999). 

Moreover, we can use the expected distance between per-document topic 
proportions to identify other documents that have similar topical content. 
We use the expected Hellinger distance, which is a symmetric distance be- 
tween distributions. Consider two documents i and j, 



where all expectations are taken with respect to the variational posterior 
distributions (see Section 3.1). One example of this application of the latent 
variable analysis is illustrated in Figure 3. 

The interested reader is invited to visit http://www.cs.cmu.edu/~lemur/science/ 
to interactively explore this model, including the topics, their connections, 
the articles that exhibit them and the expected Hellinger similarity between 
articles. 

4.2. Quantitative comparison to latent Dirichlet allocation. We compared 
the logistic normal to the Dirichlet by fitting a smaller collection of articles 
to CTM and LDA models of varying numbers of topics. This collection con- 
tains the 1,452 documents from 1960; we used a vocabulary of 5,612 words 



(1999). 
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Fig. 3. Using the Hellinger distance to find similar articles to the query article "Earth's 
Solid Iron Core May Skew Its Magnetic Field." Illustrated are the top three articles by 
Hellinger distance to the query article and the expected posterior topic proportions for 
each article. Notice that each document somehow combines geology and physics. 



after pruning common function words and terms that occur once in the col- 
lection. Using ten-fold cross validation, we computed the log probability of 
the held-out data given a model estimated from the remaining data. A bet- 
ter model of the document collection will assign higher probability to the 
held out data. To avoid comparing bounds, we used importance sampling 
to compute the log probability of a document where the fitted variational 
distribution is the proposal. 

Figure 4 illustrates the average held out log probability for each model 
and the average difference between them. The CTM provides a better fit 
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Fig. 4. (Left) T/ie 10-fold cross-validated held-out log probability of the 1960 Science 
corpus, computed by importance sampling. The CTM supports more topics than LDA. See 
figure at right for the standard error of the difference. (Right) The mean difference in 
held-out log probability. Numbers greater than zero indicate a better fit by the CTM. 



than LDA and supports more topics; the likelihood for LDA peaks near 30 
topics, while the likelihood for the CTM peaks close to 90 topics. The means 
and standard errors of the difference in log-likelihood of the models is shown 
at right; this indicates that the CTM always gives a better fit. 

Another quantitative evaluation of the relative strengths of LDA and the 
CTM is how well the models predict the remaining words of a document after 
observing a portion of it. Specifically, we observe P words from a document 
and are interested in which model provides a better predictive distribution 
of the remaining words p{w\w\-p). To compare these distributions, we use 
perplexity, which can be thought of as the effective number of equally likely 
words according to the model. Mathematically, the perplexity of a word 
distribution is defined as the inverse of the per-word geometric average of 
the probability of the observations, 

/ D N d 

Perp($)=m Jj p(wi\^,w 1: p)\ 
\d=l i=P+l ) 

where denotes the model parameters of an LDA or CTM model. Note 
that lower numbers denote more predictive power. 

The plot in Figure 5 compares the predictive perplexity under LDA and 
the CTM for different numbers of words randomly observed from the doc- 
uments. When a small number of words have been observed, there is less 
uncertainty about the remaining words under the CTM than under LDA — 
the perplexity is reduced by nearly 200 words, or roughly 10%. The reason 
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Fig. 5. (Left) The 10-fold cross-validated predictive perplexity for partially observed 
held-out documents from the 1960 Science corpus (K = 50). Lower numbers indicate more 
predictive power from the CTM. (Right) The mean difference in predictive perplexity. 
Numbers less than zero indicate better prediction from the CTM. 



is that after seeing a few words in one topic, the CTM uses topic correlation 
to infer that words in a related topic may also be probable. In contrast, 
LDA cannot predict the remaining words as well until a large portion of the 
document has been observed so that all of its topics are represented. 



5. Summary. We have developed a hierarchical topic model of docu- 
ments that replaces the Dirichlet distribution of per-document topic pro- 
portions with a logistic normal. This allows the model to capture correla- 
tions between the occurrence of latent topics. The resulting correlated topic 
model gives better predictive performance and uncovers interesting descrip- 
tive statistics for facilitating browsing and search. Use of the logistic normal, 
while more complex, may have benefit in the many applications of Dirichlet- 
based mixed membership models. 

One issue that we did not thoroughly explore is model selection, that is, 
choosing the number of topics for a collection. In other topic models, non- 
parametric Bayesian methods based on the Dirichlet process are a natural 
suite of tools because they can accommodate new topics as more documents 
are observed. (The nonpar ametric Bayesian version of LDA is exactly the 
hierarchical Dirichlet process [27].) The logistic normal, however, does not 
immediately give way to such extensions. Tackling the model selection issue 
in this setting is an important area of future research. 

APPENDIX: DETAILS OF VARIATIONAL INFERENCE 



Variational objective. Before deriving the optimization procedure, we 
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put the objective function equation (3) in terms of the variational parame- 
ters. The first term is 



(9) 

where 
(10) 



E„|logp(»7|/i,E) 



ilog IS' 1 ! - y log2vr - ^E q [(rj - fifE^fa - fi)], 



Ej^-MfS-^rz-M)] 

= Tr(diag(z/ 2 )E- 1 ) + (A - nf^iX - n). 



The nonconjugacy of the logistic normal to multinomial leads to difficulty 
in computing the second term of equation (3), the expected log probability 
of a topic assignment 

K 

(11) 



E q [logp(z n \r 1 )] = E q [r 1 T z n }-E 



log J2 exi p{^} 



\i=l 



To preserve the lower bound on the log probability, we upper bound the 
negative log normalizer with a Taylor expansion: 



(12) E, 



logfcexpfo}^ <C 1 (j2 E ^{Vi}}^j -l + log(C) 



where we have introduced a new variational parameter £. The expectation 
Egfexpj^j}] is the mean of a log normal distribution with mean and variance 
obtained from the variational parameters {Xi,iyf}: Egfexpj^}] = exp{A.; + 
vf/2} for i 6 {1, . . . , K}. This is a simpler approach than the more flexible, 
but more computationally intensive, method taken in [25]. 

Using this additional bound, the second term of equation (3) is 

(13) E 9 [logp(z„|T 7 )]=^A^ n , i -r 1 fE ex P{^ + l/ '/2}) +l-logC- 

i=l \i=l / 



The third term of equation (3) is 



K 



(14) 



E q \logp{w n \z n , (3)} = (f>n,i log /?;,„,„ • 



i=l 



The fourth term is the entropy of the variational distribution: 

K N k 

(15) 2 ( lQ g v 1 + log 2vr + 1) - ^ ^ cf> nA log ^ 



n=l i=l 



Note that the additional variational parameter £ is not needed to compute 
this entropy. 
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Coordinate ascent optimization. Finally, we maximize the bound in equa- 
tion (3) with respect to the variational parameters \\-k-, v v.Ki 4>v.n and C- 
We use a coordinate ascent algorithm, iteratively maximizing the bound 
with respect to each parameter. 

First, we maximize equation (3) with respect to £, using the second bound 
in equation (12). The derivative with respect to C is 

(16) /'(C) = N (V 2 (£ exp{A, + uf/2}j - <T 1 Y 
which has a maximum at 

K 

(17) C = E ex P{ A * + ^/2}- 

Second, we maximize with respect to 4> n . This yields a maximum at 

(18) 4>n ti (xexp{\i}fl itWn , ie{l,...,K}, 

which is an application of variational inference updates within the exponen- 
tial family [5, 7, 31]. 

Third, we maximize with respect to Aj. Equation (3) is not amenable 
to analytic maximization. We use the conjugate gradient algorithm with 
derivative 

N 

(19) dL/dX = -S- X (A -/!) + £ <f> n ,i:K - (JV/C) exp{A + u 2 /2}. 

n=l 

Finally, we maximize with respect to vf. Again, there is no analytic solution. 
We use Newton's method for each coordinate with the constraint that vi > 0, 

(20) dLldvf = -S«V2 - iV/2Cexp{A. 4 + ^ 2 / 2 } + l/(2^ 2 ). 

Iterating between the optimizations of u, A, 4> an d C defines a coordinate 
ascent algorithm on equation (3). (In practice, we optimize with respect 
to C in between optimizations for v, A and 0.) Though each coordinate's 
optimization is convex, the variational objective is not convex with respect 
to the ensemble of variational parameters. We are only guaranteed to find a 
local maximum, but note that this is still a bound on the log probability of 
a document. 
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